UNL Document Summarization
نویسندگان
چکیده
This paper proposes an approach on UNL document summarization. Our approach employs both the surface and semantic information of UNL annotation to summarize documents. With the merit of semantic annotation of the UNL, the essence of the document is efficiently collected which facilitates the abstraction function for language generation. The multilinguality can also be realized through the language decoverters from the summarized UNL document to the target languages under the UNL framework. The experiment result shows the improvement of the summarization quality in using the UNL annotation comparing with the original plain text. Introduction The UNL project ([8]) has been proposed under the aegis of the United Nations University, Japan since 1996. The UNL project is a collaborative work of research institutions from 16 countries. UNL aims to be an international semantic annotation standard for network oriented multilingual communication. The UNL framework provides a mean for representing the meaning of natural language document with a set semantic graphs. This paper introduces a summarization method to UNL document for a better summarization result. Rather than employing only the superficial information, we directly process the UNL semantic information to extract the essence of the document. Our work shows the improvement of the summarization quality in using the UNL annotation. 1 UNL specification The existing interlingua-based machine translation systems translate source languages to an interlingua and then translate the interlingua to the target language. The errors in creating the interlingua propagate to the target language generation. This drawback in the interlingual approach has impeded the progress in practical use. To improve the translation accuracy, the UNL project proposes a new paradigm in which the users directly prepare the interlingual documents called UNL as the source documents. So that the source language for the target language generation is the flawless interlingua. Supporting the UNL framework, the UNL documents are designed to contain no semantic ambiguities. UNL is a project for multilingual networking communication initiated by the United Nations University, Japan. UNL bases on an interlingual approach represented by a hypergraph. A UNL graph consists of nodes and links. A node is formed by a universal word (UW) attaching with a list of attributes (such as @entry indicating the entry node of the UNL graph; @pl indicating the plurality of the concept; @def indicating the definiteness of the concept). A link is a directed arc labeled by a semantic relation between the corresponding two nodes. A UNL document is a text encoding a set of UNL graphs. More details on UNL can be found in [1], [4], [5] and [8]. Figure 1 and 2 show an example of a UNL graph and UNL text. 2 Universal words A UW denotes an interlingual acceptation used for concept representation in UNL. Theoretically, a UW has only one meaning. In other words, UWs do not allow semantic ambiguity. The reasons why English words are employed in UW construction are that (i) English is known by all UNL developers, and (ii) there are a lot of good bi-lingual dictionaries between a local language and English available. ([5]) The expression of UW is: “ ()” e.g. book(icl>do,obj> room). Restrictions are the composition of the following constraints: 1) Icl (stands for inclusion) is the restriction defining the semantic class where the UW is included. A part of UNL class hierarchy is shown in Figure 3. For example, “car(icl> movable thing)” indicates that this UW is in the class of movable thing. 2) Any semantic relations, available for the UNL arcs, with a UNL class name can be used in restricting the meaning of the English headword. For example, eat(agt>volitional thing, obj> food) indicates that the agent of this UW is restricted to be the UWs in the class of volitional thing and the object of this UW is restricted to be the UWs in the class of food. agt obj pur qua book(icl>do,obj>room).@entry bachelor(icl>man).@def room(icl>space) 2 person(icl>body).@pl Figure 1: An example of UNL graph for “The bachelor books a room for two persons.” obj(book(icl>do, obj>room).@entry, room(icl>space)) agt(book(icl>do, obj>room).@entry, bachelor(icl>man).@def) pur(room(icl>space), person(icl>volitional thing).@pl) qua(person(icl>volitional thing).@pl, 2) Figure 2: The UNL text encoding the UNL graph in Figure 1. Figure 3: A part of UNL class hierarchy 3 UNL annotation and text summarization Most of existing works on text summarization such as [2], [3] and [7] rely on surface information of documents. Employing the surface information, these approaches select the best sentences and list them together to summarize the whole text. Without employing the semantic information, these approaches have a great drawback. The generated summaries are often not much readable and contain a lot of redundancies. However, for UNL documents, the UNL semantic information is very useful to summarize and generate high quality summaries. 3.1 Advantages of UNL document summarization 3.1.1 Multilinguality Because UNL provides an interlingua expression framework, the UNL document summarization can be generated in many target languages without any additional work. The decoverter generates the desired target languages from the summarized UNL document. 3.1.2 Unambiguity The UNL document does not allow semantic ambiguity in the annotation. Summarization of the UNL document ensures a high quality and clarity. For example, to summarize a document on plants, an ambiguity on whether plant means a factory or a tree may occur. But if the document is annotated by UNL in which different concepts are represented by different UWs, this ambiguity is clarified. The problem in multiple statistical count is consequently avoided. 3.1.3 Semantic information Rather than employing only the superficial surface information, to summarize UNL documents we also employ the deep semantic information. This semantic information improves the quality of summarization. With this information, we can remove redundancy and combine sentences into a more meaningful and readable one. No. English UNL Expression 1 UNL represents the means to facilitate multilingual communication on the information network. aoj(represent.@entry.@pred.@present, UNL) obj(represent.@entry.@pred.@present, means.@def) met(facilitate.@pred, means.@def) obj(facilitate.@pred, communication) mod(communication, multilingual.@indef) mod(communication, network.@def) mod(network.@def, information) 2 The language exists only on the information network. obj(exist.@entry.@pred.@present, language.@def) lpl(exist.@entry.@pred.@present, network.@def) mod(network.@def, only) mod(network.@def, information) 3 UNL is a global-scale common language, being transparent to all languages. aoj(language.@entry.@pred.@present.@indef, UNL) aoj(global-scale, language.@entry.@pred.@present.@indef) aoj(common, language.@entry.@pred.@present.@indef) aoj(transparent.@pred,language.@entry.@pred.@present.@indef) ben(transparent.@pred, language:02.@pl) mod(language:02.@pl, all) 4 Information encoded in UNL is converted to an equivalent counterpart written in the target language, through a language generator "deconvertor" prepared for each language. obj(encode.@pred, information) met(encode.@pred, UNL) obj(convert.@entry.@pred.@present, information) gol(convert.@entry.@pred.@present, counterpart.@indef) aoj(equivalent, counterpart.@indef) obj(write.@pred, counterpart.@indef) met(write.@pred, language:01.@def) mod(language:01.@def, target) met(convert.@entry.@pred.@present, generator.@indef) mod(generator.@indef, language:02) cnt(generator.@indef, deconvertor) obj(prepare.@pred, generator.@indef) ben(prepare.@pred, language:03) mod(language:03, each) 5 Complying with the same technical standards, these computer networks comprise the Internet. aoj(technical, standard.@pl.@def) mod(standard.@pl.@def, same) gol(comply.@pred, standard.@pl.@def) mod(network.@pl, computer) mod(network.@pl, these) man(comprise.@pred.@present.@entry, comply.@pred) aoj(comprise.@pred.@present.@entry, network.@pl) obj(comprise.@pred.@present.@entry, Internet.@def) Table 1: The 5 best sentences selected for summarization. 3.2 UNL document summarization Mainly, there are 4 steps in UNL document summarization. The first step is to calculate a score for each UNL sentence. According to this score, the nbest sentences for summarization are selected. Employing the UNL semantic information, the redundant words or phases in the selecting sentences are removed. Then some selecting sentences are combined to improve readability and naturalness. 3.2.1 Calculating sentence-score In order to select the best sentences for summarization, a score is calculated for each sentence. A sentence score is calculated by the weight of each word constituting the sentence. Weight of each word is computed according to its term frequency and inverted document frequency ([6]) as following.
منابع مشابه
A survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملPruning UNL texts for Summarizing Purposes
This paper presents a summarization model based on the Universal Networking Language (UNL), which is a conceptual language for representing texts sentence by sentence, using semantic binary relations that are claimed to convey all the information of the corresponding sentence in natural language. Our summarization model is based on heuristics for pruning sentences, focusing on UNL binary relati...
متن کاملRevisiting UNLSumm: Improvement Through a Case Study
This paper presents some improvements that have been made to UNLSumm, an automatic summarizer of UNL Texts. These are texts encoded in the Universal Networking Language (UNL), an artificial language intended to convey conceptualizations of natural language texts. In UNLSumm, a set of heuristics is responsible for, based in the UNL characteristics, prune sentences of a UNL Text, generating its U...
متن کاملSome Lexical Issues of UNL
The Universal Networking Language (UNL) developed by Dr. H. Uchida at the Institute for Advanced Studies of the United Nations University is a meaning representation language designed for multi-lingual communication in electronic networks, information retrieval, summarization and other applications. We discuss several features of this language relevant for correct meaning representation and mul...
متن کاملText Summarization Using Cuckoo Search Optimization Algorithm
Today, with rapid growth of the World Wide Web and creation of Internet sites and online text resources, text summarization issue is highly attended by various researchers. Extractive-based text summarization is an important summarization method which is included of selecting the top representative sentences from the input document. When, we are facing into large data volume documents, the extr...
متن کامل